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Summary 


'  The  issue  of  creating  overdispersion  in  a  given  one  parameter  one  dimensional  exponential 
family,  by  extending  it  to  a  two  parameter  exponential  family  with  the  same  support,  is 
considered.  An  easily  verifiable  sufficient  condition  for  this  is  derived.  It  is  shown  that  a  large 
class  of  families  satisfy  this  condition  and  that  this  class  includes  Efron’s  (1986)  and  Lindsay’s 
(1986)  family  as  special  cases.  This  class  is  also  closely  related  to  Jorgensen’s  (1988) 
Exponential  Dispersion  Models.  UMP  unbiased  tests  for  testing  overdispersion  are  exhibited 
and  it  is  shown  that  in  this  context  Cox’s  overdispersion  test  is  a  special  case.  A  graphical 
display  is  developed  to  select  a  family  of  the  general  class  which  should  be  used  with  a  given 
set  of  data  to  overdisperse  a  given  target  one  parameter  family.  Two  real  data  illustrations  are 
given. 

Keywords:  Exponential  family,  overdispersion,  exponential  dispersion  model,  weighted  least 
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1.  Introduction 

Overdispersion  as  an  issue  has  been  recognized  by  data  analysts  for  many  years.  Samples 
are  often  found  to  be  too  heterogeneous  to  be  explained  by  a  one  parameter  family  of  models  in 
the  sense  that  the  implicit  mean-variance  relationship  in  such  a  family  is  violated  by  the  data;  the 
sample  variance  is  large  compared  with  that  predicted  by  inserting  the  sample  mean  into  the 
mean-variance  relationship.  The  natural  remedy  is  to  consider  a  larger  collection  of  models,  say 
a  two  parameter  family.  Historically  the  most  frequently  used  means  of  doing  this  has  been  to 
mix  the  one  parameter  family  with  a  two  parameter  family  creating  a  two  parameter  marginal 
mixture  family  for  the  data.  At  the  same  value  of  the  mean  such  mixing  typically  inflates  the 
model  variance.  In  fact  Shaked  (1980)  shows  that  for  a  one  parameter  exponential  family  this  is 
necessarily  the  case.  Cox  (1983)  noted  that  for  modest  amounts  of  overdispersion  a  full 
specification  of  the  mixing  distribution  was  unnecessary;  only  its  mean  and  variance  (two 
parameters)  are  needed. 

Does  the  assumption  that  the  one  parameter  family  is  an  exponential  family,  whence  the 
mean-variance  relationship  (usually  called  the  variance  function)  is  immediately  available 
through  the  normalizing  function,  enhance  our  ability  to  do  modeling  and  inference  for 
overdispersion?  We  argue  that  the  answer  is  yes  by  developing  a  general  class  of  two  parameter 
exponential  families  which  are  overdispersed  relative  to  a  given  one  parameter  exponential 
family.  Customarily,  the  one-parameter  exponential  family  has  been  mixed  with  a  two 
parameter  conjugate  distribution  (see  e.g.,  Morris,  1982).  The  resulting  overdispersed  family  of 
mixture  models  will  often  be  awkward  to  work  with  since  it  will  not  be  an  exponential  family. 
Hence  straightforward  optimal  inference  may  be  sacrificed.  To  model  overdispersion  Efron 
(1986)  creates  a  sc  called  double  exponential  family  using  Hoeffding’s  representation  of  a  one 
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parameter  exponential  family  through  the  Kullback-Leibler  distance.  In  fact  this  family  is  a  two 
parameter  exponential  family  whose  normalizing  function  has  a  very  simple  approximate  form. 
Efron’s  arguments  are  all  asymptotic  relying  upon  the  assumption  that  each  observation  is  itself 
an  average  of  a  large  number  of  observations.  For  any  fixed  sample  size  Efron’s  family  is  a 
special  case  of  ours  and  therefore  asymptotics  are  not  needed  to  argue  for  overdispersion. 
Lindsay  (1986)  addresses  the  slightly  different  question  of  whether  mixing  a  one  parameter 
experimental  family  can  produce  a  two  parameter  exponential  family.  He  shows  that  using  so- 
called  reweighted  infinitely  divisible  families  as  mixing  distributions  will  achieve  this.  Again, 
our  general  two-parameter  family  includes  Lindsay’s.  Within  Jorgensen  (1987)  the  one 
parameter  exponential  family  is  extended  to  a  two  parameter  class  of  distributions  which  is 
called  an  exponential  dispersion  model  (EDM).  Extending  our  two  parameter  family  to  account 
for  sample  size  yields  a  class  of  models  having  two  interesting  features.  First  this  class  is 
overdispersed  relative  to  Jorgensen’s  two  parameter  EDM.  Second  this  class  is  itself 
approximately  an  EDI  1. 

In  section  2  we  offer  our  basic  results.  In  section  3  we  comment  on  sampling  models  by 
introducing  sample  size  into  our  family.  Finally  in  section  4  we  develop  a  UMP  unbiased  test 
for  overdispeision,  propose  a  graphical  display  and  illustrate  our  methods  with  two  examples. 

2.  Basic  Results 

Consider  the  two-parameter  exponential  family  T ,  having  densities,  with  respect  to  some  o- 
finite  measure  G ,  of  the  form 

Pe.x(>,)  =  exp{0y -fxr(y)-p(0,t)}  (2.1) 

We  write  expectations  under  this  densi  y  using  the  notation  E(S(y)  |  0 ,  t).  We  presume  that  the 
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naturaJ  parameter  space  contains  a  two-dimensional  rectangle  which,  by  translation,  can  be  taken 
to  contain  x  =  0.  We  denote  the  one  parameter  model  at  t  =  0  by 

pe(y)  =  exp{ey-x(0)}  (2.2) 

In  (2.2),  the  “normalizing  function”  x(0)  is,  in  fact,  the  log  moment  generating  function  of  G 
with  x'(0)  s  p  the  mean  of  Y ,  and  x”(0)  the  variance.  Since  x’O  is  strictly  increasing,  it  has  a 
unique  strictly  increasing  inverse  function,  say,  0  =  T|(|i). 

In  (2.1)  let  p(rj)  =  dpr+s /ae'at*  ,r,s  £0.  We  recall  that  p(lt0)=  £(K|0,x), 
pa0)  =  var(Y  I  e.x),  p ^}  =  E(Y-E(Y))2(X(Y)-  £T(r))|0,x)  etc.  In  the  sequel  we 
suppose  in  (2.2)  that  corresponding  to  0O  the  mean  is  p<).  In  (2.1)  define  0^(x)  by  the  implicit 
function  E(Y  |  0 ,  x)  =  p^  i.e.,  0^(x)  indexes  the  curved  subfamily  of  (2.1),  Tw  where  the  mean 
is  Po-  Implicit  differentiation  yields  0,^(x)  =  -  ((5  £  (1"  |  0 ,  x)  /  d  x)  / 

0£(y  |  0,x)/30»  I te^Tj.t)), ie., 

6^(x)  =  -cov(Y,T  |  e^x), x)/var(Y  |  0Jx),x)  (2.3) 

Thus,  by  the  association  inequality  if  T  is  monotone  0^(x)  is. 

As  a  definition  the  family  of  models  (2.1)  is  said  to  be  overdispersed  relative  to  the  family  of 
models  (2.2)  if,  keeping  the  means  fixed  in  (2.1)  the  variance  increases  in  x.  More  precisely,  we 
mean  that  for  each  subfamily,  $  ^variY)  increases  in  x.  Defining  g^( x)  =  £  ( Y 2  |  0^(x),  t),  a 
sufficient  condition  for  such  overdispersion  is  g  ^'(x)  £  0  for  x  £  0.  Underdispersion  is  defined 
by  requiring  var(Y )  decreasing  in  x  for  each  !P^  with  sufficient  condition  g^(x)  <  0  for  x  >  0. 
Underdispersed  models  rarely  occur  in  practice. 
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Lemma  2.1  provides  a  convenient  analytic  expression  forget). 

Lemma  2.1 

g^(x)  =  cov  (Y2,T)-cov (Y,Y2)cov(Y ,T)/var(Y)  (2.4) 

where  expectations  are  taken  at  (0^(T),t)  in  (2.4)  and  in  the  proof. 

Proof:  Differentiating  under  the  integral  sign  we  have 

£*,(*)  =  e^)£  (Y3)  +  E(Y2T(Y))~  (dp(Q^(x),%)/dx)  E O'2).  Using  (2.3)  and  the  fact  that 
d  p(0Mo(x) ,  t )ldx  =  0^(t)  E  (Y )  +  ET  (Y )  with  simple  manipulation  yields  (2.4). 

We  note  that  for  arbitrary  members  of  !P  the  right  hand  side  of  (2.4)  may  be  expressed  in 
terms  of  derivatives  of  p: 

cov  ( y 2 . r )  -  cov (y  ,  Y2)cov  ( Y,T)lvar(Y)  =  p(2<1)  -  p(3>0)  (p(U  >  / p(2>0))  (2.5) 

We  now  state  our  basic  result. 

Theorem  2.1:  A  sufficient  condition  that  family  (2.1)  is  overdispersed  relative  to  family 
(2.2)  is  that  T(y)  is  convex. 

The  proof  of  the  result  directly  follows  by  Lemma  2.2  taken  from  Dalai,  Kcmperman  and 
Mallows  (1988). 

Lemma  2-2:  Let  S  j  and  S2be  both  convex  or  both  concave  functions.  Then  for  any  random 
variable  Y 

cov (Sj(y),S 2(y ))  var (Y)2  cov (Y  , S , (Y ))  cov (Y  , S 2(Y )) 


If  Y  has  support  at  more  than  2  points,  the  inequality  is  strict  provided  either  5  j  or  S2  is 
nonlinear.  Lemma  2.2  immediately  provides  a  sufficient  condition  for  (2.4)  to  be  positive  thus 
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proving  the  theorem. 

An  alternative  proof  of  Theorem  2.1  arises  from  ideas  contained  in  Shaked  (1980).  We  may 
easily  deduce  the  following  modification  of  his  Theorem  1. 

Lemma  2.3:  Consider  any  pair  of  distinct  densities  f  and  g  with  respect  to  some 
dominating  measure  v.  Iff  Ig  is  convex  and  Ef(y)  =  £g(y)  then  the  number  of  sign  changes 
for  f  -  g  is  two  and  the  sequence  is  +, — ,+. 

The  conclusion  of  Lemma  2.3,  again  with  £y(Y)  =  £g(Y),  implies  that,  provided 
expectations  exist,  £/  (W  (y )  >  Eg  W  (y )  for  all  real  convex  W .  (An  elementary  proof  is  given  in 
Schweder  (1982)).  Thus  by  taking  W(y)  =  y2  we  have  var^(y)^  varg(y).  Theorem  2.1  now 
follows  by  choosing  any  two  members  of  CP  w  identifying  as  /  the  one  with  the  larger  X.  The 
convexity  of  T (y )  implies  the  convexity  of / Ig. 

Indeed  the  convexity  of  T  yields  a  somewhat  stronger  notion  of  overdispersion  than  our 
definition  since,  within  CPW  both  of  the  proofs  imply  ordering  by  x  of  expectations  of  an 

arbitrary  convex  function,  i.e.,  for  any  arbitrary  convex  function  S , 

d  I  dx(ES  (Y )  |  9^(x),  x)  t  0.  This  inequality  allows  for  comparison  of  skewness,  kurtosis,  etc. 

Writing  Efron’s  (1986)  family  in  the  form  (2.1)  reveals  T(y)  =  y  T\(y)~x(X[iy))  0l(y)  is 
defined  below  (2.2)). Hence  r'(y)  =  Tl(v),  a  strictly  increasing  function,  so  his  T  is  convex  and 
his  family  is  contained  in  (2.1).  Lindsay  (1986)  shows  that  the  family  (2.1)  arises  by  suitable 
mixing  of  (2.2)  provided  T  is  the  log  moment  generating  function  of  an  infinitely  divisible 
family  of  distributions,  equivalently  provided  T'  is  absolutely  monotone.  If  our  interest  is  in 
overdispersion,  we  can  allow  the  wider  class  of  convex  7”s.  This  may  serve  to  mitigate  his 
concern  (p.129)  regarding  finding  T’s  such  that  exp(r(y))  is  integrable  wit:,  esp  v:  to  G .  For 
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instance  T(y)  =  y  logy  is  convex  but  T'(y)  is  not  absolutely  monotone  on  R+.  This  T  is  used 

by  Efron  (1986)  and  in  Section  4.4  below  for  the  case  when  (2.2)  is  the  Poisson  family.  Note 

o  2 

that  if  exp(yz)  is  integrable  with  respect  to  G  we  may  argue  that  (2.1)  with  t(y)=y 

approximate s ,  for  T  small,  an  arbitrary  mixture  of  (2.2)  provided  the  mixinj? 
distribution  has  finite  second  moment.  This  follows  directly  from  Cox  (1983,p.27 

The  following  corollary  to  Theorem  2.1  quantifies  the  relative  overdispersion  of  (2.1)  to  (2.2) 
for  x  small. 

Corollary  2.1.  If  in  (2.1  )T  is  convex  and  x  is  small,  positive  then 

var(Y  |  etlo(x),t)/V0t<))=l+flT  +  O(T2)  (2.6) 

where  a  =  g  ^  (0)  /  y  (m>)  >  0  with  V(p)  =  (dr\/d\i)~l,  the  variance  function  associated  with 
(2.2). 

Proof.  Write  the  numerator  of  the  left  hand  side  of  (2.6)  as  ^  expand  in  a 

Taylor  series  about  x  =  0. 

We  conclude  this  section  by  noting  that  in  (2.1)  the  parameters  x  and  |i  =  £  e  T(  y)  are 
orthogonal  i.e.  £  d2logpeT(y  )/d|ich  =  0,  as  can  be  verified  by  direct  calculation  .  See  Cox  and 
Reid  (1982)  and  Bamdorff-Nielsen  (1978,  p.184)  for  further  discussion. 

3.  Sampling  Models  and  Asymptotics 

To  incorporate  sample  size  into  our  models,  suppose  Z  is  the  average  over  n  independent 
replications  of  (2.2).  Then  by  convolution  the  density  of  Z  becomes 


/e<n(z)  =  cxp[/t(ez-x(0))) 


(3.1) 
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with  respect  to  Gn ,  the  corresponding  convolution  measure  of  G .  That  is,  Gn  has  log  moment 
generating  function  Treating  n  in  (  3.1)as  a  so-called  dispersion  parameter  by  allowing  it 

to  range  over  the  subset  of  R+  such  that  n x(0)  is  a  log  moment  generating  function  for  some 
measure  Hn,  Jorgensen  defines  (3.1)  to  be  an  exponential  dispersion  model  (EDM).  Note  that 
this  dispersion  parameter  is  not  related  to  our  notion  of  overdispersion.  In  fact  we  wish  to 
formulate  an  overdispersed  family  of  models  relative  to  the  EDM,  (3.1). 

Similar  convolution  of  (2.1)  does  not  produce  an  EDM.  Here  p(0,  x)  is  the  log  moment 
generating  function  of  exp [zT(y))d  G(y )  but  np(0,  x)  will  generally  not  be  the  log  moment 
generating  function  of  exp{x7'(y))<i  Gn(y).  Rather  the  measure  Gn  will  depend  upon  T  as  well 
as  n.  Instead  consider  the  extension  of  (3.1),  paralleling  (2.1),  to  the  family  of  densities  with 
respect  to  Gn  of  the  form 

/e.T,nOO  =  exp{«0z  +mnx7'(z)-p„(0,T))  (3.2) 

where  again  T  is  convex,  p„  is  the  normalizing  function,  and  the  sequence  mn  >  0  is  to  be 
determined. 


For  fixed  n  hence  m„ ,  Theorem  2.1  shows  that  on  curved  subfamilies  of  (3.2)  where  the 

,  dpn 

mean,  n~  — is  held  constant  the  variance  will  increase  in  x.  Thus  if  we  consider 
dd 

independent  replications  of  (3.2)  with  n  constant  (balanced  samples),  (3.2)  serves  as  an 
overdispersed  family  of  models  for  (3. 1 )  regardless  of  how  mn  is  chosen.  For  unbalanced  data 
extending  (3.1)  to  (3.2)  will  lead  to  n  varying  over  independent  observations  from  (3.2).  In  this 
case  we  claim  that  the  choice  of  m„  matters  and  that  mn~n  is  the  appropriate  choice. 


Sampling  from  (3.1)  with  varying  n  is  interpreted  as  drawing  averages  based  upon  differing 
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sample  sizes.  Thus  for  (3.2)  to  suitably  extend  (3.1)  the  mean  should  be  approximately  constant 
over  n .  With  regard  to  overdispersion  consider  the  usual  mixture  model  approach.  If  we  mix 
(3.1)  with  some  distribution  H  having  mean  | iH  and  variance  a# ,  the  relative  overdispersion  of 
the  resulting  mixture  distribution  to  (3.1)  is  ( n  V(p)  +  ojj)  /  n"V(|iH)  which  tends  to  »  as 

n  — »  ».  That  is,  since  the  mixing  distribution  is  assumed  not  to  depend  on  n ,  taking  additional 
observations  within  a  population  does  not  increase  our  knowledge  regarding  heterogeneity 
across  populations.  (An  open  extension  of  Lindsay’s  (1986)  work  is  whether  (3.1)  can  be  mixed 
by  a  distribution  free  of  n  to  produce  a  two-parameter  exponential  family.  Our  ensuing 
discussion  suggests  that  the  answer  is  no.) 

We  now  argue  roughly  that  regardless  of  the  choice  of  mn ,  the  models  (3.2)  can  not  produce 
the  “mixing  type”  of  overdispersion  relative  to  (3.1).  They  can  achieve  a  limit  for  the  relative 
overdispersion  which  is  a  constant  >  1  and  this  occurs  only  when  mn-n.  We  note  that  such  a 
limit  arises  in  Efron’s  (1986)  formulation.  The  discussion  by  Kent  to  Jorgenson  (1987)  alludes 
to  this  difference  in  “type”  of  overdispersion. 

Suppose  n  is  large  with  mn  =rt.  Expanding  p„  (0,  x)  about  x  =  0  we  have  for  small  x 

3p„(0,x)  , 

Pn(e.'t)  =  Pn(0,O)  +  X - — -  |  (6. 0) 

=  n  x(0)  +  xn£„ (T (Z )  j  Z  -foM) 

=  n  (x(0)  +  x  T  (p.))  +  0(1 )  (3.3) 

where  |i  =  x’(0)-  The  last  equality  follows  by  expansion  of  T  (z )  about  p.  In  fact  we  can  take 
additional  terms  in  the  expansion  of  pn(9,  x)  and  by  similar  argumentation  eventually  assert  that 
p„  (0,  x)  can  be  expressed  in  the  form 
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p„(e,x)=n  v(e,t)  +  o(i) . 


(3.4) 


Then  for  large  n  clearly  En  (Z  |  0 ,  x)  = 


so  that  the  mean  remains  roughly  constant  across  n . 


In  fact  for  n  large,  under  (3.4),  (3.2)  will  be  approximately  an  EDM  and  thus  enjoy  the  same 
small  dispersion  asymptotics  (see  Jorgensen,  1987,  p.  135)  as  EDM’s  do.  We  note  that  Efron’s 
(1986)  double  exponential  family  is  of  the  form  (3.2)  with  mn  =  n  and  (3.4)  holding.  Moreover 
we  can  extend  our  calculations  of  Section  2  to  this  case.  The  left  hand  side  of  (2.5)  becomes 
n~26(0,  x)  +  o  (n~2)  where  S(0,  x)  =  /  x/2*05.  The  relative  overdispersion 

(n_26(90,  0)  +  o  (n~2))  n  X 

(2.6)  becomes  1  + - ; — — r- -  +  O  (X  )  which  tends  to  a  constant  >  1.  This 

n~VZ)(6o.O) 

argument  generalizes  Efron’s  Fact  2  (p.  711). 


For  a  general  sequence  mr,  (3.3)  becomes 

m„ 

n  (x(6)  + - X  T(p))  +  0(1) .  (3.5) 

n 

If  m„-o{n),  (3.5)  is  n  X(9)  +  °(n)  i-e-  asymptotically  (3.2)  behaves  like  (3.1).  If  n  =o(mn) 
from  (3.5)  we  see  that  upon  differentiation,  the  mean  of  (3.2)  will  not  be  stable  over  n .  In  fact  it 
tends  to  oo  a s  n  -»<».  Thus  mn=n  is  the  unique  choice  producing  overdispersion  of  (3.4) 
relative  to  (3.1). 


In  Section  4  we  confine  ourselves  to  independent  replications  from  (3.2)  with  n  constant. 
More  interesting  problems  would  fit  (3.2)  allowing  n ,  0,  and  x  to  vary  across  independent 
observations  expressing,  for  the  ith  observation,  0,  and  x,  through  generalized  linear  models 
(see  McCullagh  and  Nelder,  1983).  Like  Efron  (1986),  this  could  be  incorporated  in  our  setting. 
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4.  Test  and  Displays  for  Overdispersion 

4.1  UMP  Unbiased  Test 

If  Tj, . . .  ,Yk  are  an  independent  sample  from  (3.2)  then  standard  theory  gives  a  UMP 
unbiased  test  for  overdispersion.  That  is,  to  test  //0 :  x  =  0  vs.  HA  :  x  >  0  we  reject  for 

znr,)>col)  (4.i) 

where  p  =  ZP,//:.  Recall  that  the  MLEs  for  p,x  solve  p(1,0)  =  p, p^0,1)  =  XT(y,)/it.  Since 
p(0,1)  increases  in  t  we  must  have  x  increasing  in  2T(Yi)  for  fixed  |i.  We  may  write  (4.1)  as 
x  >  d  (]I).  The  concluding  remark  of  section  2  shows  that  for  k  large  p  and  x  are  approximately 
independent.  Thus  the  unconditional  test  based  upon  the  asymptotic  normal  distribution  of  X 
under  H0  will  be  approximately  UMP  unbiased.  As  Cox  and  Reid  (1987,  p.2)  note,  the 
asymptotic  standard  error  of  x  is  the  same  whether  p  is  known  or  not.  In  the  special  case 
T(y)  =  y2,  (4.1)  can  be  written  in  the  appealing  form 

L(Yi-Y)2>c(Y)  (4.2) 

capturing  our  informal  notion  of  overdispersion.  In  fact  under  the  family  (3.2),  (4.2)  is 
essentially  Cox’s  (1983,  p.  272)  test  for  overdispersion. 

4.2  Graphical  Displays 

We  consider  the  case  where  (3.2)  is  a  distribution  on  the  nonnegative  integers.  Extension  to 
continuous  distributions  could  be  similarly  developed  by  partitioning  the  domain  of  Y  into 
intervals.  Our  approach  has  its  roots  in  the  work  of  Gart  (1969)  and  Ord  (1970).  These  papers 
investigate  standard  overdispersion  cases  e.g.  Beta  Binomial  to  Binomial,  Negative  Binomial  to 
Poisson,  Binomial  to  Hypergeometric.  Suppose  then  the  class  of  models 


- 12- 


P  e.-cCv )  =  A  O'  >  sxp{0y  -i-  x  7*(y )  —  p(6,x) } ,  y  =0,1,2,...  (4.3) 


Pe,x(y)  a  density  with  respect  to  counting  measure.  How  might  we  develop  a  display  to  see  the 
presence  of  overdispersion  and  to  suggest  a  good  T?  We  do  not  view  this  as  an  optimality 
problem.  Allowing  varying  convex  Ts  in  (4.3)  would,  for  appropriate  0,T,  yield  comparably 
fitting  models.  If  Y  j, . . . ,  1*  are  a  sample  from  (4.3)  define  py  to  be  the  observed  proportion  of 
Ts  equal  to  y.  Lindsay  (1986)  suggests  examination  of  the  log  residual  ry  =log (py  Ip  e.oO')) 
where  0  is  the  MLE  under  (4.3)  when  x  =  0.  Suppose  Pq,  x0,  0^(xo)  are  the  true  parameter 
values.  Since py  =  Pejxj,xc(y)  +  Op(k~irl)  and  0=0^(O)  +  0/>(AT1/2)  we  can  show  that 


y 


+  T(y)  + X(6J0)) -  p(0^(xo) ,  x0)  +  Op  (k~ia) . 


(4.4) 


If  we  could  remove  the  linear  term  in  (4.4)  we  might  more  easily  see  whether  x  >  0  i.e.  see  the 
presence  of  overdispersion.  Let 

Sy  =log  tfy+1  h(y)/(py  h(y+l )))  . 

Then  analogous  to  (4.4)  we  can  show  that 

Sy  ^(Xol  +  Xo  [T(y+\)-T(y)}  +  Op(k~lf2)  (4.5) 

The  linear  term  has  been  removed.  Since,  for  T  convex,  T(y+1)-T(y)  increases  in  y,  sy 
should  be  increasing  in  y  if  overdispersion  is  present.  An  analogy  here  to  unadjusted  and 
adjusted  (or  partial)  residual  plots  is  noteworthy.  A  plot  of  ry  vs  y  corresponds  to  the  former;  a 
plot  of  sy  vs  y  to  the  latter.  The  latter  display  should  be  more  effective  in  seeing  overdispersion. 
Moreover  since  T(y+1)-T(y)  behaves  like  T'(y )  this  plot  allows  us  to  readily  see  trends  in  T' 
and  thus  to  suggest  candidate  T’s.  Note  that  because  sy  is  a  function  of  two  dependent  random 
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variables  (as  is  ry )  successful  use  of  such  displays  will  require  k  large  and  suitable  truncation  of 
y .  Since  our  effort  here  is  exploratory  this  should  not  cause  concern. 

43  Fitting  the  Model 

Fitting  of  models  (4.3)  and  the  goodness  of  such  fits  is  discussed  extensively  in  Lindsay.  In 
general  ML  estimation  is  unattractive  because  the  normalizing  function  is  usually  not  available 
in  closed  form.  Weighted  least  squares  is  a  straightforward  alternative.  We  have  investigated 
both  [y  =  log(j5y  /  h(y ))  and  sy  (Lindsay  uses  ry ).  Let  m  +  1  be  the  smallest  value  of  y  such  that 
p  =  0.  For  Cy  we  minimize 

I  w,  Vy-Qy+X T(y)  +  c))2  (4.6) 

y=o 

over  0,  x  and  c  where  wy  =  py  /(I  -  jf  ).  For  sy  we  minimize 

z\  {Sy-(9  +  x<T<y  +  D-7*(y)))2  (4.7) 

,=o 

over  0  and  x  where  wy  =  py+\py  /  ipyJr\  +py  )•  The  weight  wy  is  (up  to  a  constant)  the  reciprocal 
of  the  estimated  variance  of  ty  and  sy  respectively.  We  ignored  the  covariances  in  the  fitting  on 
two  grounds.  First,  Lindsay’s  theoretical  work  (Theorem  4.1)  shows  that  the  least  squares 
estimates  resulting  from  (4.6)  are  asymptotically  efficient  if  the  domain  of  (4.3)  is  bounded. 
Second,  for  the  two  data  sets  in  section  4.4  the  full  covariance  matrix  amongst  the  Cy  or  amongst 
the  sy  can  be  estimated  by  the  delta  method.  For  both  data  sets  for  Cy  and  for  sy  the  diagonal 
terms  dominated  the  estimated  inverse. 

When  using  sy,  6  and  x  immediately  provide  estimates  of  Pe.tO')  to  (4.3)  up  to  the 
normalizing  constant.  This  constant  is  then  computed  terminally  to  correctly  standardize  the 


-  14- 


fitted  cell  probabilities.  When  using  ly ,  £  estimates  the  normalizing  constant  but  in  fact  once  0 
and  x  were  obtained  we  ignored  £ .  Rather  we  again  calculated  c  terminally  to  standardize. 

Comparison  between  observed  and  fitted  was  done  using  Pearson’s  chi  square  statistic. 
When  using  ly  we  have  m  +  1  cells  with  3  parameters;  when  using  sy  we  have  m  cells  with  2 
parameters.  Thus  in  either  case  m  -  3  degrees  of  freedom  are  associated  with  the  goodness  of  fit 
statistic.  Intuitively  we  might  expect  poorer  fitting  using  the  sy .  They  are  the  log  of  a  ratio  of 
random  variables  and  would  thus  be  expected  to  be  more  variable  than  the  L .  This  is  perhaps 
bome  out  for  the  second  data  set  in  section  4.4. 

4.4  Two  Examples 

The  data  in  Table  1  is  taken  from  Sokal  and  Rohlf  (1973,  p.67)  and  has  been  examined  by 
e.g.  Shaked  (1980).  It  consists  of  the  frequency  of  males  in  6115  sibships  of  size  12  in  Saxony, 
1876-85.  Taking  an  initial  binomial  model  the  overall  /S  =  .519,  nfi(\-p)  =  2.996  while 
S2  =  3.490  suggesting  overdispersion.  The  UMPU  test  for  overdispersion  using  (4.2)  with  a 
normal  approximation  is  extremely  significant.  Figure  1  plots  sy  vs  y  revealing  the  expected 
increasing  pattern.  In  fact  since  Figure  1  reveals  a  roughly  linear  relationship  T  (y )  =  y2  is 
suggested.  Fitting  using  ty  produced  0=-.2585,  x  =  .0265;  the  fitted  probabilities  are  given 
under  py(1)  in  Table  1.  Fitting  using  sy  produced  0=  -.2463,  x  =  .0260;  the  fitted  probabilities  are 
given  under  pJ(2)  in  Table  1.  The  fits  are  very  close  and  both  are  excellent.  For  pyl)  y2  =  15.41 , 
for  py(2)  x2=  14.54  with  d.f.  =  10. 

The  data  in  Table  2,  originally  collected  by  Thyrion  (1961)  is  taken  from  Seal  (1969)  and  has 
been  analyzed  by  Lindsay  (1986)  and  others.  It  consists  of  observed  counts  of  accidents  in  a 
year  for  9461  Belgian  drivers.  Taking  an  initial  Poisson  model  i=.0214  with  S2  =  .0289 
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suggesting  ovcrdispcrsion.  Figure  2  plots  sy  vs.  y  supporting  this.  In  this  situation  integrable 
choices  for  T(y)  are  limited.  For  example  T(y)  =  y2  or  r(y)  =  (y+l)log(y+l)  are  not. 
Ti(y)  =  e~y  was  used  by  Lindsay.  We  use  T2(y)  —  y  logy  which  arises  from  Efron’s  (1986) 
"double  Poisson"  example.  Figure  2  suggests  that  T'  is  possibly  concave  which  is  satisfied  by 
both  Tl  and  T2.  (Experimentation  not  presented  shows  that  for  T2,  (4.3)  resembles,  except  in 

A  A 

the  far  tails,  a  negative  binomial  distribution).  Fitting  using  ly  produced  0=— 1.833,  t  =  .7546; 
the  fitted  probabilities  are  given  under  pyw  in  Table  2.  Fitting  using  sy  produced 
0= -1.792,  t  =  .6322;  and  the  fitted  probabilities  are  given  under py2)  in  Table  2.  The  fits  are 
similar.  The  goodness  of  fit  test  collapsing  y  £  5  has  3  d.f.  For  pjl)  x2  =  25.92,  for 
py2)  X2  =  40.76.  While  Lindsay’s  fits  (p.131)  appear  to  be  better  the  comparison  is  unfair  since 
he  has  really  employed  a  3  parameter  model.  In  any  event  our  discussion  shows  that  Efron’s 
family  may  not  be  adequate  and  that  allowing  more  general  convex  T  increases  Lindsay’s 
possibilities. 
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Table  1 


Sibship  data  (Sokal  and  Rohlf,  1973)  with  fitted  probabilities 


Observed 


y 

Counts 

_ & _ 

pi" 

pi" 

m 

3 

0.0004 

1 

24 

0.0039 

0.0037 

0.0038 

2 

104 

0.0170 

0.0171 

0.0177 

3 

286 

0.0468 

0.0508 

0.0520 

4 

670 

0.1096 

0.1073 

0.1088 

5 

1033 

0.1689 

0.1696 

0.1706 

6 

1343 

0.2196 

0.2058 

0.2057 

7 

1112 

0.1818 

0.1934 

0.1922 

8 

829 

0.1356 

0.1395 

0.1380 

9 

478 

0.0782 

0.0754 

0.0743 

10 

181 

0.0296 

0.0290 

0.0285 

11 

45 

0.0074 

0.0071 

0.0070 

12 

7 

0.0011 

0.0008 

0.0008 

Table  2 


Accident  data  (Seal,  1969)  with  fitted  probabilities 


Observed 


y 

Counts 

Py 

pi" 

pi2' 

9 

7840 

0.8287 

0.8286 

0.8282 

i 

1317 

0.1380 

2 

239 

0.0253 

0.0302 

0.0276 

3 

42 

0.0044 

0.0068 

0.0051 

4 

14 

0.0015 

0.0015 

0.0009 

5 

4 

0.0004 

0.0003 

0.0001 

6 

4 

0.0004 

0.0001 

0.0 

7 

l 

0.0001 

0.0 

0.0 
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